There are various administrative tasks that need to be performed when a new node is added to the cluster. This document describes the tasks that should be performed in different scenarios. Cassandra is a key/value store that uses rows and columns. Each row in a cluster must have a unique key. Cluster nodes form a token ring using consistent hashing where each node is responsible for a range of keys on the token ring. Prior to Cassandra 1.2, when a node was added to the cluster, tokens would have to be manually recalculated and reassigned to each node. Then maintenance operations would have to be carried out to rebalance the cluster. With virtual nodes, introduced in Cassandra 1.2, these tasks are no longer necessary.
There are a number of administrative tasks that need to be performed.
Verify that the new node has successfully joined the cluster
Increase replication factory of system_auth keyspace
Run read repair on system_auth keyspace
I think this needs to be run on each node
Repair should be run sequentially since it is a resource-intensive operation
Run cleanup on system_auth keyspace
Increase replication factor of rhq keyspace
Run read repair on rhq keyspace
Run cleanup on rhq keyspace
Update seeds lists
Verify that the ring is balanced
Each of these tasks and why they need to be performed are discussed below.
We cannot perform the other tasks until the new node is fully bootstrapped where it has joined the ring, has the latest schema (which is propagated from the first node), and has it own token ranges. This should be done from the server since this is a cluster-wide check.
If we keep the RF of the system_auth keyspace at 1 and the node that contains the user data goes down, clients will not be able to make new connections to the cluster. It is therefore imperative to increase the RF to avoid that scenario. This will be done from the server since it a cluster-wide change.
Increase the RF of the rhq keyspace to 2. This will be done from the server since it is a cluster-wide change.
This should be done from the server since it is a cluster-wide change.
We need to run repair after after increasing the RF to ensure both nodes are consistent with respect to the system_auth keyspace. This should be done on the agent.
I do not think that repair needs to be run on the new node, but this needs to be verified. We do not want to run repair if we do not have to because it is a CPU and IO intensive operation.
Performed by the agent and is necessary to ensure nodes are consistent with respect to the rhq keyspace. Again, I do think we need to repair on the new node.
Run on the agent side on both nodes. Seeds lists have to be updated so that nodes know where to join the cluster at start up.
Removes keys that the node no longer owns. This is done on the agent and only on the first node.
This is a brief check to make sure that the previous operations worked correctly. This should be done by the server since it is a cluster-wide check.